RED WINE DATA ANALYSIS BY MALLIDI AKHIL REDDY

Description

In this project, we will use our knowledge of analysing single, two and multivariable analysis to discover the patterns among the variables in red wine data and to predict chemicals that influences the quality of red wine.

Summary of the dataset

##  [1] "X"                    "fixed.acidity"        "volatile.acidity"    
##  [4] "citric.acid"          "residual.sugar"       "chlorides"           
##  [7] "free.sulfur.dioxide"  "total.sulfur.dioxide" "density"             
## [10] "pH"                   "sulphates"            "alcohol"             
## [13] "quality"
## 'data.frame':    1599 obs. of  13 variables:
##  $ X                   : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ fixed.acidity       : num  7.4 7.8 7.8 11.2 7.4 7.4 7.9 7.3 7.8 7.5 ...
##  $ volatile.acidity    : num  0.7 0.88 0.76 0.28 0.7 0.66 0.6 0.65 0.58 0.5 ...
##  $ citric.acid         : num  0 0 0.04 0.56 0 0 0.06 0 0.02 0.36 ...
##  $ residual.sugar      : num  1.9 2.6 2.3 1.9 1.9 1.8 1.6 1.2 2 6.1 ...
##  $ chlorides           : num  0.076 0.098 0.092 0.075 0.076 0.075 0.069 0.065 0.073 0.071 ...
##  $ free.sulfur.dioxide : num  11 25 15 17 11 13 15 15 9 17 ...
##  $ total.sulfur.dioxide: num  34 67 54 60 34 40 59 21 18 102 ...
##  $ density             : num  0.998 0.997 0.997 0.998 0.998 ...
##  $ pH                  : num  3.51 3.2 3.26 3.16 3.51 3.51 3.3 3.39 3.36 3.35 ...
##  $ sulphates           : num  0.56 0.68 0.65 0.58 0.56 0.56 0.46 0.47 0.57 0.8 ...
##  $ alcohol             : num  9.4 9.8 9.8 9.8 9.4 9.4 9.4 10 9.5 10.5 ...
##  $ quality             : int  5 5 5 6 5 5 5 7 7 5 ...

From the above results, it clearly shows that the dataset has 1599 rows of data and each row has 13 variables. Out of which two variables are of int datatype i.e., quality and X. X is an variables that carries an unique value for each observation in the dataset. Remaining all other variables are of num datatype.

##        X          fixed.acidity   volatile.acidity  citric.acid   
##  Min.   :   1.0   Min.   : 4.60   Min.   :0.1200   Min.   :0.000  
##  1st Qu.: 400.5   1st Qu.: 7.10   1st Qu.:0.3900   1st Qu.:0.090  
##  Median : 800.0   Median : 7.90   Median :0.5200   Median :0.260  
##  Mean   : 800.0   Mean   : 8.32   Mean   :0.5278   Mean   :0.271  
##  3rd Qu.:1199.5   3rd Qu.: 9.20   3rd Qu.:0.6400   3rd Qu.:0.420  
##  Max.   :1599.0   Max.   :15.90   Max.   :1.5800   Max.   :1.000  
##  residual.sugar     chlorides       free.sulfur.dioxide
##  Min.   : 0.900   Min.   :0.01200   Min.   : 1.00      
##  1st Qu.: 1.900   1st Qu.:0.07000   1st Qu.: 7.00      
##  Median : 2.200   Median :0.07900   Median :14.00      
##  Mean   : 2.539   Mean   :0.08747   Mean   :15.87      
##  3rd Qu.: 2.600   3rd Qu.:0.09000   3rd Qu.:21.00      
##  Max.   :15.500   Max.   :0.61100   Max.   :72.00      
##  total.sulfur.dioxide    density             pH          sulphates     
##  Min.   :  6.00       Min.   :0.9901   Min.   :2.740   Min.   :0.3300  
##  1st Qu.: 22.00       1st Qu.:0.9956   1st Qu.:3.210   1st Qu.:0.5500  
##  Median : 38.00       Median :0.9968   Median :3.310   Median :0.6200  
##  Mean   : 46.47       Mean   :0.9967   Mean   :3.311   Mean   :0.6581  
##  3rd Qu.: 62.00       3rd Qu.:0.9978   3rd Qu.:3.400   3rd Qu.:0.7300  
##  Max.   :289.00       Max.   :1.0037   Max.   :4.010   Max.   :2.0000  
##     alcohol         quality     
##  Min.   : 8.40   Min.   :3.000  
##  1st Qu.: 9.50   1st Qu.:5.000  
##  Median :10.20   Median :6.000  
##  Mean   :10.42   Mean   :5.636  
##  3rd Qu.:11.10   3rd Qu.:6.000  
##  Max.   :14.90   Max.   :8.000

The above results, gives the values of min, 1st quartile, median, mean, max and 3rd quartile for each variable in the dataset.

Univariate Plots Section

For Univariate Analysis, create plots for each individual variable.

The Volatile Acidity distribution of sample’s in the dataset lies between 0.6 to 1.37. It also has some larger value at 1.6. The Fixed Acidity distribution of sample’s in the dataset lies in between 4.5 to 15. It also has some larger values at 16. The Citric Acid distribution of sample’s in the datset lies in between 0 to 0.81. From the above we can observe that citric acid has more number of Zeroe’s in its distribution. It also has some larger values at 1. The Residual Sugar distribution of sample’s in the dataset lies in between 0.7 to 9.2. It also has some larger values at 13 and 16. The Chlorides distribution of sample’s in the dataset lies in between 0.1 to 0.28. It also has some larger values at 0.4 and 0.6. The free.sulfur.dioxide distribution lies in between 0 to 45. It also has larger values at 50 and 70. The total.sulfur.dioxide distribution of sample’s in the dataset lies in between 10 to 160. It also has larger values at 300. The Density distribution of sample’s in the dataset lies in between 0.990 t0 1.000. The pH distribution of sample’s in the dataset 2.75 to 3.75. It also has some large values at 4.0. The Sulphates distribution of sample’s in the dataset lies in between 0.25 to 1.4. It also has some larger values at 1.6 and 2.0. The Alcohol Distribution of sample’s in the dataset lies in between 8.2 to 14. It also has some larger values at15. Interestingly, the patterns observed in the distribution of Quality is as follows: - Most of the sample’s quality is of either 5 or 6. - Fewer number of sample’s quality is of either 7 or 8. - Lesser number of sample’s quality is of either 3 or 4.

Univariate Analysis

What is the structure of your dataset?

There are about 1599 rows of data and has 13 variables.

What is/are the main feature(s) of interest in your dataset?

Since we want to identify the chemicals that influences the quality of red wine, the main feature of interest is quality.

What other features in the dataset do you think will help support your  investigation into your feature(s) of interest?

I think alcohol and other acidic properties like citric.acid, volatile.acidity and fixed.acidity will help our investigation, because these properties may influence the taste of the wine.

Did you create any new variables from existing variables in the dataset?

No, I haven’t created any new variable in this section.

Of the features you investigated, were there any unusual distributions? Did you perform any operations on the data to tidy, adjust, or change the form of the data? If so, why did you do this?

Quality has more values in 5 and 6, where as 3 and 4 has values less in number.

Bivariate Plots Section

The box plots were plotted against Quality with each other variable’s. Among all plots we can see a strong relation in between quality and alcohol, quality and volatile.acidity, quality and citric.acid & quality and sulphates.

##  poor  good ideal 
##    63  1319   217

The correlation between quality and remaining variables is as follows. To get the variables that has strong correlation with quality.

## 
##  Pearson's product-moment correlation
## 
## data:  red_wine$fixed.acidity and as.numeric(red_wine$quality)
## t = 4.996, df = 1597, p-value = 6.496e-07
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.07548957 0.17202667
## sample estimates:
##       cor 
## 0.1240516
## 
##  Pearson's product-moment correlation
## 
## data:  red_wine$volatile.acidity and as.numeric(red_wine$quality)
## t = -16.954, df = 1597, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.4313210 -0.3482032
## sample estimates:
##        cor 
## -0.3905578
## 
##  Pearson's product-moment correlation
## 
## data:  red_wine$citric.acid and as.numeric(red_wine$quality)
## t = 9.2875, df = 1597, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.1793415 0.2723711
## sample estimates:
##       cor 
## 0.2263725
## 
##  Pearson's product-moment correlation
## 
## data:  red_wine$residual.sugar and as.numeric(red_wine$quality)
## t = 0.5488, df = 1597, p-value = 0.5832
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.03531327  0.06271056
## sample estimates:
##        cor 
## 0.01373164
## 
##  Pearson's product-moment correlation
## 
## data:  red_wine$chlorides and as.numeric(red_wine$quality)
## t = -5.1948, df = 1597, p-value = 2.313e-07
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.17681041 -0.08039344
## sample estimates:
##        cor 
## -0.1289066
## 
##  Pearson's product-moment correlation
## 
## data:  red_wine$free.sulfur.dioxide and as.numeric(red_wine$quality)
## t = -2.0269, df = 1597, p-value = 0.04283
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.099430290 -0.001638987
## sample estimates:
##         cor 
## -0.05065606
## 
##  Pearson's product-moment correlation
## 
## data:  red_wine$total.sulfur.dioxide and as.numeric(red_wine$quality)
## t = -7.5271, df = 1597, p-value = 8.622e-14
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.2320162 -0.1373252
## sample estimates:
##        cor 
## -0.1851003
## 
##  Pearson's product-moment correlation
## 
## data:  red_wine$density and as.numeric(red_wine$quality)
## t = -7.0997, df = 1597, p-value = 1.875e-12
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.2220365 -0.1269870
## sample estimates:
##        cor 
## -0.1749192
## 
##  Pearson's product-moment correlation
## 
## data:  red_wine$pH and as.numeric(red_wine$quality)
## t = -2.3109, df = 1597, p-value = 0.02096
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.106451268 -0.008734972
## sample estimates:
##         cor 
## -0.05773139
## 
##  Pearson's product-moment correlation
## 
## data:  log10(red_wine$sulphates) and as.numeric(red_wine$quality)
## t = 12.967, df = 1597, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.2636092 0.3523323
## sample estimates:
##       cor 
## 0.3086419
## 
##  Pearson's product-moment correlation
## 
## data:  red_wine$alcohol and as.numeric(red_wine$quality)
## t = 21.639, df = 1597, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.4373540 0.5132081
## sample estimates:
##       cor 
## 0.4761663

As our main interest feature is quality, from above correlation matrix and scatter plot matrix, the most(strong) correlated variables with quality are alcohol, volatile.acidity, sulphates and citric acid.

From the above plot, that is between Quality and Alcohol, we can observe an positive relation among them. From the above plot, that is between Quality and Volatile Acidity we can observe an negative relation among them. From the above plot, that is between Quality and Citric Acid we can observe an positive relation among them. From the above plot, that is between Quality and Sulphates we can observer an positive relation among them.

Bivariate Analysis

Talk about some of the relationships you observed in this part of the

I found a lot of relationships between the variables against quality, from correlation matrix and scatter plot matrix. Some of the positive correlated variables are alcohol,sulphates,citric acid, fixed acidity. Some of the negative correlated variables are volatile acidity, density, chlorides.

Did you observe any interesting relationships between the other features

From scatter plot matrix below, i observed an interesting relationship between chlorides and residual sugar. The Scatter plot between cholrides and residual sugar is below.

What was the strongest relationship you found?

The strongest relationship that i was found is in betwwen pH and fixed.acidity.

Multivariate Plots Section

From the above plot, it is observed that Citric Acid and Volatile Acidity has a negative relation among them. From the above plot, it is observed that Citric Acid and log10(Sulphates) has a positive relation among them. From the above plot, it is observed that Citric Acid and Alcohol has a positive relation among them. From the above plot, it is observed that Volatile Acidity and log10(Sulphates) has a positive relation among them. From the above plot, it is observed that Volatile Acidity and Alcohol has a positive relation among them. From the above plot, it is observed that Sulphates and Alcohol has a positive relation among them.

Multivariate Analysis

After faceting with rating, for a wine to be good it should have higher citric acid and lower volatile acid. It will also have large amount of alcohol and sulpahtes.

Final Plots and Summary

Plot One

Description One

This plot depicts about the distribution of wine samples with respect to quality. For most of samples quality is 5 or 6. Where as few of them are less than 5 and few of them are greater than 6.

Plot Two

Description Two

From the above plot, it is suprisingly noted pH has no impact on the quality of wine.

Plot Three

Description Three

The above plot shows that the good wines contains large amounts of citric acid. And also depicts that lower the amount of volatile quality, leads to the good wine.

Reflection

The given dataset has data of 1599 samples of wines.

By performing EDA, i was able to find the patterns among the data like the features that influences the quality of wine.

First I had gone through the individual variable data to understand every variable insights and patterns by plotting an histogram for every variable in the given dataset.

Then from ggpair and correlation matrix, i have found four variables that are strongly correlated with the quality of wine. Then i plotted a bivariate plot between these variables aganist quality.

Multi variate plots are plotted between these variables, to get more deeper understanding how these influences the quality of wine, and i came to know that the good quality wine should have large amount of citric acid and less amount of volatile acidity. Also good wine has large quantities of alcohol content and sulphates in it.

I am not much familiar with chemistry, so i dont have much knowledege of these chemicals, that may limited my insights into the data.

For the future work, i think there should be a wider range of dataset. Like sweetness of wine has impact or no impact on quality of wine etc to get more insights on the quality of wine.